31.4 Data Mining
387
31.3
Knowledge Representation
Most obviously, knowledge representation is a medium of human expression, typ-
ically a language. In bioinformatics, the representation should be chosen to assist
computation; for example, the attributes of an object being optimized using evolu-
tionary computation (Sect. 4.3) have to be encoded in the (artificial) chromosome; it
may be sufficient to represent their presence by “1” and their absence by “0”, in the
case of binary encoding.
Ideally, the representation should provide a guide to the organization of
information—indeed knowledge might be defined as “organized (structured) infor-
mation”. Thus, the ontologies discussed in the previous section are an attempt to
represent knowledge in this spirit. The most desirable kind of organization is that
which facilitates making inductive inferences—and this will be most successfully
achieved if as few preconceptions as possible are imposed on the organization.
Powerful ways of representing knowledge need not involve words, or symbolic
strings, at all. Visualization (cf. Sect. 13.4) may be much more revealing than a verbal
description. A particular advantage is the possibility of rearranging materials in two,
rather than in one, dimension. In this regard, languages based on ideographs, most
notably Chinese, would appear to be very powerful, since concepts can be rearranged
on a sheet of paper and novel juxtapositions can be freely generated.
As knowledge becomes more and more complex, good examples of which are
the organization of living organisms (Fig. 14.1) and their regulation (e.g., Fig. ??),
novel ways of representing it need to be creatively explored. One approach that may
prove useful is to represent knowledge as probability distributions, conditional upon
more or less certain facts emanating from observations or laboratory experiments;
as more data becomes available, inferences can then be continuously updated in a
far more systematic manner than is currently carried out today.
31.4
Data Mining
The goal of data mining is usually stated as finding meaningful new patterns from a
mass of more or less unstructured data (the ore in the mining analogy, a great part of
which will be discarded as gangue). In a nutshell, it is the process of analysing large
datasets to discover patterns and insights. It involves applying algorithms and statis-
tical methods to identify relationships and correlations between different variables.
It is hoped that data mining can be used to uncover trends unperceived by a human
observer. Hence, it is sometimes called knowledge discovery in databases (KDD).
The primary motivation is the vast accumulation of data from high-throughput tech-
nologies, including nucleic acid sequencing and microarrays. There is an underlying
notion that “knowledge” or “meaning” can be self-revealing; depending on the defi-
nitions of these terms (cf. Chap. 6) this goal may be illusory, much like the notion of